Loading libraries for data import, analysis, plotting, distribution fitting, string matching, pretty printing data frames and tables

library(dplyr)
library(data.table)
library(plotly)
library(MASS)
library(stringr)
library(knitr)

Task 1

Assignment

Logistics plays a more and more important role in the product development of the automobile industry. Parts produced by the supplier must first be delivered to the OEM before they can be installed. What seems logical at first sight should be analyzed in more detailed way for a professional application. Therefore, create a distribution for the logistics delay of component „K7”. Use the production date (“Produktionsdatum”) from the data set “Komponente_K7.csv” and the receiving date of incoming goods (“Wareneingang”) from “Logistikverzug_K7.csv” (logistics delay). You can assume that produced goods are issued one day after production date. For the model design in R, create a new data set “Logistics delay” that contains the required information from both data sets.

  1. How is the logistics delay distributed? Justify your choice with statistical tests and briefly describe your approach.
  2. Determine the mean of the logistics delay (watch out for weekends). Please interpret this number and discuss possible alternatives.
  3. Visualize the distribution in an appropriate way by displaying the histogram and the density function using “plotly”. Please describe how you selected the size of the bins.
  4. Please describe how you proceed, if you have to create a decision tree, which is describing the classification problem to classify whether the component (K7) is defective (Fehlerhaft) or not? (Hint: You might want to work with visualizations.)

Preparation of analysis

Importing the data files. Komponente_K7.csv contains the production dates of the components, while Logistikverzug_K7.csv contains the date of arrival for the next production stage. The first column of both files is dropped since it contains only line numbers. Information about defects is included as integers with the values 0 (false) or 1 (true). fread() interprets this as an integer, so it is converted to a logical value (TRUE or FALSE).

production <- fread(file.path("Data","Logistikverzug","Komponente_K7.csv"), header=TRUE, drop=1) %>%
  mutate(Fehlerhaft = as.logical(Fehlerhaft))
arrival <- fread(file.path("Data","Logistikverzug","Logistikverzug_K7.csv"), header=TRUE, drop=1) %>%
  mutate(Fehlerhaft = as.logical(Fehlerhaft))

The two tables with the data imported from the files are now joined into one along the vehicle ID (“IDNummer”).

logistics <- production %>%
    inner_join(arrival,by="IDNummer",suffix=c(".Komponente",".OEM"))

Since part manufacturer, OEM and logistics provider work seven days a week, weekends are of no importance to the logistic delay. Therefore, it can be calculated as simply the number of days between production and arrival at OEM (reduced by one, since the component is dispatched on the day after production).

logistics <- logistics %>% 
    mutate(logistic_delay = as.numeric(Wareneingang) - as.numeric(Produktionsdatum))

mean_delay <- mean(logistics$logistic_delay)

The (arithmetic) mean logistic delay is calculated to be 7.0804366. On average, the components will be on the road for one week. As an alternative, the median will be computed to make the analysis less susceptible to outliers with an unusually long logistic delay like the ones caused by the delivery problems during the COVID-19 pandemic. This is especially problematic because outliers to the right cannot be compensated by outliers to the left, since there can be no negative logistic delay (provided there are no significant breakthroughs in time travel research).

median(logistics$logistic_delay)
## [1] 7

Analysis

The logistic delay is assumed to be either normally or logarithmic normally distributed. This assumption stems from a preliminary analysis, which led to the conclusion that the density function looks most similar to these two distributions. Both distributions are fitted to the data with the fitdistr() function from the MASS library.

lognorm_fit <- fitdistr(logistics$logistic_delay, densfun="lognormal")
norm_fit <- fitdistr(logistics$logistic_delay, densfun="normal")

The data is plotted as a histogram and as a line plot of the density function, together with the density functions of the normal and log normal distribution, as generated by the functions dlnorm() (for log normal) and dnorm() (for normal distribution). While the histogram can be created completely automatically, the density function, which looks very similar, needs some preparatory work: The data is grouped by logistic delay and then counted, which results in a table that lists how often a certain delay occurred. This is possible, because the delay can only have discrete, integer values (whole days).

The plot shows relative frequency, i.e. how often a certain delay occurs relative to the total number of deliveries. This means that the sum of the blue histogram bars and the area under the curve of the density functions are all 1.

Since the data only has an accuracy of one day, the automatically selected bins with a size of 1 day are sufficient. This leads to an acceptable number of 12 bins.

density <- logistics %>% group_by(logistic_delay) %>% count()
fig <- plot_ly(x = ~logistics$logistic_delay, 
               type="histogram", 
               histnorm='probability density', 
               name="Original data",
               width=800, height=400) %>% 
  add_lines(x = ~density$logistic_delay, 
            y=~density$n/nrow(logistics),
            name="Density function",
            line=list(width=5)) %>%
  add_lines(x=seq(4,12,0.1), 
            y=dlnorm(seq(4,12,0.1), meanlog = lognorm_fit$estimate["meanlog"], sdlog =lognorm_fit$estimate["sdlog"]),
            name="Log normal distribution",
            line=list(dash="dash",
                      color="black")) %>%
  add_lines(x=seq(4,12,0.1), 
            y=dnorm(seq(4,12,0.1), mean = norm_fit$estimate["mean"], sd =norm_fit$estimate["sd"]),
            name="Normal distribution",
            line=list(dash="dot",
                      color="black")) %>%
  layout(xaxis = list(dtick = 1, title="Logistic delay (in days)"),
         yaxis = list(range=c(0,0.5), title="Relative frequency"))

fig

Both the normal and log normal distributions describe the data reasonably well and their plots follow the original density function plot closely. The plot does not allow a decision as to which distribution is the better fit. The results become much clearer, however, when examining the parameters of the fits.

lognorm_fit
##      meanlog         sdlog    
##   1.9473378003   0.1409926029 
##  (0.0002546761) (0.0001800832)
norm_fit
##       mean           sd     
##   7.080436556   1.012299959 
##  (0.001828526) (0.001292963)

The second line respectively shows the standard errors of the fits. Here it becomes apparent that the errors for the logarithmic normal distribution are nearly one order of magnitude smaller, signifying a more precise fit.

Deciding if a component is defective

The provided data includes two columns to indicate if a component is defective, one for the component manufacturer side and one for the OEM side. Since they are both originally called “Fehlerhaft”, they are named “Fehlerhaft.Komponente” and Fehlerhaft.OEM" in the joined dataset.

If both of these columns agree, the decision is easy. However, there could be cases where the classification on the component manufacturer and the OEM side disagree. To find out if these cases exist, the logistics table is grouped by the two columns and then counted. This displays all present combinations of defect classifications and how often these combinations occur.

knitr::kable(logistics %>% 
               group_by(Fehlerhaft.Komponente,Fehlerhaft.OEM) %>% 
               count()
             )
Fehlerhaft.Komponente Fehlerhaft.OEM n
FALSE FALSE 306484
TRUE FALSE 6

As we can see, most components were classified as “not defective” (FALSE) by the component manufacturer and the OEM. The 6 components that were classified as defective on the component manufacturer side were not classified as defective by the OEM. There are several possible explanations for this:

  1. The defects disappeared during transport. This is unlikely.
  2. The quality control (QC) on the component manufacturer side is stricter than on the OEM side. The manufacturer noticed a defect, but delivered the component anyway, probably hoping the OEM QC would not notice it, which it indeed did not.

Since none of the sides can be completely trusted (the manufacturer might cheat and the OEM might be less familiar with the details of what exactly can be defective in a component), the part should be considered defective if either of the sides finds a defect. The resulting decision tree looks like this:

consider defective if either Fehlerhaft.Komponente or Fehlerhaft.OEM are TRUE

Task 2

Assignment

Why does it make sense to store the available data in separate files instead of saving everything in a huge table? Name at least four benefits. The available tables represent a typical data base structure. How is it called?

Benefits of seperate data files

  • Analyses which do not require all the data can selectively load data and save on loading time and RAM space.
  • Seperate files allow saving each bit of information exactly one time, while in a huge table much of the information on vehicles and comonents would have to be stored multiple times. This has several benefits:
    • Reducing the overall size of the dataset (although this is offset by the requirement to store several ID columns)
    • Avoiding errors when updating data
  • Data providers can be allowed selective access to the respective files, keeping permissions restrictive.

Task 3

Assignment

How many of the parts T16 ended up in vehicles registered in Adelshofen?

Importing and filtering the relevant data

Part 16 is used in the components K2LE2 and K2ST2 (seats). Therefore, this analysis needs:

  • The vehicle -> component files for all four vehicle models. These all have the same column names, so they are loaded all together by using the lapply() function to apply the fread() function to the four files. They are then joined together into one table, which is then filtered for vehicles with the components K2LE2 and K2ST2.
vehicle_component_files <- c(file.path("Data","Fahrzeug","Bestandteile_Fahrzeuge_OEM1_Typ11.csv"),
                             file.path("Data","Fahrzeug","Bestandteile_Fahrzeuge_OEM1_Typ12.csv"),
                             file.path("Data","Fahrzeug","Bestandteile_Fahrzeuge_OEM2_Typ21.csv"),
                             file.path("Data","Fahrzeug","Bestandteile_Fahrzeuge_OEM2_Typ22.csv"))

vehicle_component <- lapply(vehicle_component_files, fread, header=TRUE, drop=1) %>%
  rbindlist()

vehicle_component_filtered <- vehicle_component %>%
  filter(str_detect(ID_Sitze,"K2LE2") | str_detect(ID_Sitze,"K2ST2"))

The resulting data table, vehicle_component_filtered, has 818844 entries, so the components containing part 16 have been installed 818844 times.

  • The registration data. After loading tthe data file with fread(), this is filtered for the place of registration “Adelshofen”. To prevent any problems with capitalisation, all the entries in the relevant column “Gemeinden” are converted to capital letters (which they should already be in anyway).
reg_data <- fread(file.path("Data","Zulassungen","Zulassungen_alle_Fahrzeuge.csv"), header=TRUE, drop=1) %>%
  mutate(Gemeinden = str_to_upper(Gemeinden))

reg_data_filtered <- reg_data %>%
  filter(str_detect(Gemeinden,"ADELSHOFEN"))

These tables are now joined into one along the vehicle IDs. This final, complete table is called reg_veh_comp, because it links registration data with the corresponding vehicles and components.

reg_veh_comp <- reg_data_filtered %>%
  inner_join(vehicle_component_filtered, by=c(IDNummer="ID_Fahrzeug"))

Finding the number of vehicles with part 16 registered in Adelshofen

The complete table has 96 entries, so 96 vehicles using part 16 were registered in Adelshofen.

There is one caveat. Two municipalities in Germany are named “Adelshofen”. The larger one, Adelshofen in county Fürstenfeldbruck near Munich, is called “ADELSHOFEN” in the registration data, while the smaller one, Adelshofen in county Ansbach near Rothenburg ob der Tauber, is included as “ADELSHOFEN1”. If we group the table reg_veh_comp_part by the column “Gemeinden”, we can see how many of the vehicles were registered in each of the two towns.

knitr::kable(reg_veh_comp %>% group_by(Gemeinden) %>% count())
Gemeinden n
ADELSHOFEN 48
ADELSHOFEN1 48

We can see that in each of the two towns called “Adelshofen”, 48 vehicles with part 16 were registered.

Task 4

Assignment

Which data types do the attributes of the registration table “Zulassungen_aller_Fahrzeuge” have? Put your answers into a table which is integrated into your Markdown document and describe the characteristics of the data type(s).

Data types of Zulassungen_aller_Fahrzeuge.csv

The CSV file format does not technically prescribe data types, since everything is represented in text format and could theoretically be imported as characters, but the format of the data strongly implies certain data types. Import functions like fread() detect the implied formats and convert the data accordingly.

Column name Data type Explanation
[Column 1] int (Integer) An integer is a non-fraction number, i.e. a number which can be divided by 1 without remainder. It can be positive or negative.
IDNummer, Gemeinden chr (Character) This data type is for data in text form. This can be any kind of text of arbitrary length. Since most data can be represented as text, this data type is very versatile.
Zulassung IDate (integer-based date) This contains a date, with a precision of one day. It is represented internally as an integer number of days since a certain defined “day 0”.

Task 5

Assignment

You want to publish your application. Why does it make sense to store the records on the database of a server? Why can’t you store the records on your personal computer? What is an easy way to make your application available to your customers? Please name 4 aspects.

Making the application available to the public

Storing the data on a local PC means the PC would have to function as a server, otherwise it cannot make the data available to other users. While software for this purpose is readily available, this would require the PC to be permanently running and online. Depending on the access count, a PC might also be underpowered to serve a large number of connections without disturbing the normal work of the owner. There are also security risks involved, because a malicious user could gain access to other, confidential and/or personal data stored on the PC.

For this reason, the application should be made accessible via a dedicated server. The shiny app framework is already based on HTML, so it can easily be made accessible to the public as a website. Hosting the application on a web server would help to prevent users from gaining direct access to the precise data on the logistics chain, which is likely a trade secret, since the file Final_Data_Group_11.csv is only available to the server software, but not accessible to the client.

Task 6

Assignment

On 11.08.2010 there was a hit and run accident. There is no trace of the license plate of the car involved in the accident. The police asks for your help, as you work for the Federal Motor Transport Authority, and asks where the vehicle with the body part number “K5-112-1122-79” was registered.

Identifying a vehicle by its body part number

It is assumed the police request is for the purpose of criminal prosecution and therefore legal under § 98c StPO.

The relevant data files have already been imported for task 3. First, the column “ID_Karosserie” (which contains body part numbers) of table vehicle_component, linking vehicles to their components, is filtered for the given body part number, to find the corresponding vehicle ID.

accident_vehicle_components <- vehicle_component %>%
  filter(ID_Karosserie == "K5-112-1122-79")

The vehicle in the accident was the one with the ID 12-1-12-82. To find it’s place of registration, the registration data in reg_data is now filtered for this ID.

accident_vehicle_registration <- reg_data %>%
  filter(IDNummer == accident_vehicle_components$ID_Fahrzeug)

The vehicle involved in the accident was registered on 02 Jan 2009 in ASCHERSLEBEN.